Skip to content

model: (qwen3next) correct vectorized key_gdiff calculation#19324

Merged
ngxson merged 2 commits intoggml-org:masterfrom
ngxson:xsn/qwen3_next_key_gdiff
Feb 4, 2026
Merged

model: (qwen3next) correct vectorized key_gdiff calculation#19324
ngxson merged 2 commits intoggml-org:masterfrom
ngxson:xsn/qwen3_next_key_gdiff

Conversation

@ngxson
Copy link
Contributor

@ngxson ngxson commented Feb 4, 2026

Testing with the provided prompt from #19305

image

@ngxson
Copy link
Contributor Author

ngxson commented Feb 4, 2026

Quite fun: after applied 4bfbf0b, I asked the model to identify the bug (giving it the code before that commit). It does successfully identified the problem and even suggested me to make one more improvement (commit d871ac8)

image

@Mushoz
Copy link

Mushoz commented Feb 4, 2026

We've officially arrived at self-improving AI it looks like ;)

@ggerganov
Copy link
Member

My test cases that were failing before are now passing with this change.

@github-actions github-actions bot added the model Model specific label Feb 4, 2026
@ngxson
Copy link
Contributor Author

ngxson commented Feb 4, 2026

I updated to compare-logprobs script can reran it. There are still some diversions from vLLM (I suppose due to numerical issues), but it does look better on long context (see tokens past 5000 depth):

PR

idx logits_llama.log logprob_1 logits_vllm.log logprob_2 diff (abs)
1 ' ' -3.0408 ' ' -3.0440 0.0033
2 '\n\n' -0.6087 '\n\n' -0.5918 0.0170
3 ' API' -0.7177 ' API' -0.8431 0.1254
4 ' lightweight' -0.2557 ' lightweight' -0.2838 0.0281
5 ' and' -0.1517 ' and' -0.1594 0.0077
6 ' C' -0.0635 ' C' -0.0332 0.0302
7 ' HTTP' -0.0113 ' HTTP' -0.0080 0.0032
8 ' server' -0.0037 ' server' -0.0066 0.0029
9 ' based' -0.0240 ' based' -0.0691 0.0451
10 ' on' -0.0000 ' on' -0.0000 0.0000
1011 ' GPU' -1.0844 ' GPU' -1.1533 0.0689
1012 ' parameters' -0.0969 ' parameters' -0.1143 0.0175
1013 ' to' -0.2201 ' to' -0.1712 0.0488
1014 ' fit' -0.0660 ' fit' -0.0926 0.0266
1015 ' model' -0.1862 ' model' -0.3159 0.1297
1016 ' available' -0.3698 ' available' -0.5154 0.1456
1017 ' memory' -0.0490 ' memory' -0.0509 0.0019
1018 ' (' -0.0223 ' (' -0.0401 0.0178
1019 ' or' -0.1865 ' or' -0.3119 0.1253
1020 ' '' -0.0016 ' '' -0.0011 0.0005
5021 ' tokens' -0.0002 ' tokens' -0.0001 0.0001
5022 ' at' -0.0000 ' at' -0.0000 0.0000
5023 ' a' -0.6503 ' minimum' -0.6290 0.0213
5024 ' Default' -0.0021 ' Default' -0.0005 0.0015
5025 ' `' -0.0000 ' `' -0.0000 0.0000
5026 ' Set' -0.7499 ' Time' -0.6461 0.1037
5027 ' a' -0.1898 ' a' -0.2087 0.0189
5028 ' time' -0.0009 ' time' -0.0005 0.0003
5029 ' limit' -0.0007 ' limit' -0.0009 0.0002
5030 ' for' -0.6250 ' (' -0.7296 0.1046

master

idx logits_llama.log logprob_1 logits_vllm.log logprob_2 diff (abs)
1 ' ' -3.0408 ' ' -3.0440 0.0033
2 '\n\n' -0.6088 '\n\n' -0.5918 0.0170
3 ' API' -0.8385 ' API' -0.8431 0.0046
4 ' lightweight' -0.2408 ' lightweight' -0.2838 0.0430
5 ' pure' -0.6919 ' and' -0.1594 0.5325
6 ' C' -0.0190 ' C' -0.0332 0.0142
7 ' HTTP' -0.0373 ' HTTP' -0.0080 0.0293
8 ' server' -0.0021 ' server' -0.0066 0.0044
9 ' based' -0.6722 ' based' -0.0691 0.6031
10 ' on' -0.0001 ' on' -0.0000 0.0000
1011 ' GPU' -1.3448 ' GPU' -1.1533 0.1915
1012 ' GPU' -0.6565 ' parameters' -0.1143 0.5422
1013 ' to' -0.7436 ' to' -0.1712 0.5724
1014 ' fit' -0.1738 ' fit' -0.0926 0.0812
1015 ' model' -0.6253 ' model' -0.3159 0.3094
1016 ' available' -0.9195 ' available' -0.5154 0.4041
1017 ' memory' -0.0584 ' memory' -0.0509 0.0075
1018 ' (' -0.0105 ' (' -0.0401 0.0296
1019 '/'' -0.9424 ' or' -0.3119 0.6305
1020 ' '' -0.0020 ' '' -0.0011 0.0009
5021 ' tokens' -0.0002 ' tokens' -0.0001 0.0001
5022 ' at' -0.0004 ' at' -0.0000 0.0004
5023 ' minimum' -0.2551 ' minimum' -0.6290 0.3738
5024 ' Default' -0.0005 ' Default' -0.0005 0.0000
5025 ' `' -0.0000 ' `' -0.0000 0.0000
5026 ' Set' -0.0209 ' Time' -0.6461 0.6252
5027 ' a' -0.3147 ' a' -0.2087 0.1061
5028 ' time' -0.0003 ' time' -0.0005 0.0002
5029 ' limit' -0.0062 ' limit' -0.0009 0.0053
5030 ' in' -0.5323 ' (' -0.7296 0.1973

@ngxson ngxson merged commit 8abcc70 into ggml-org:master Feb 4, 2026
66 of 75 checks passed
@Mushoz
Copy link

Mushoz commented Feb 4, 2026

It does still deviate in the token that was picked at position 5030. Shouldn't numerical precision issues still result in the same token?

@ngxson
Copy link
Contributor Author

ngxson commented Feb 4, 2026

Not always, numerical differences can accumulate enough to change the output logits. But I think it may depend on the quantization that I'm using (q8_0). Will need to do more testing. But for now, I think the current fix should already be good enough.

@ggerganov
Copy link
Member

Btw, it's quite funny watching the ollama bros copy-pasting our bugs into their "new engine" 🤣. Let's see how long it will take them to realize.

https://github.com/ollama/ollama/pull/14051/changes#diff-1b8f23e564159d80674c3a97ca9f02489ad6ee90bf956f4ed92062811a6be0e5R447-R453

image

@CISC
Copy link
Member

CISC commented Feb 4, 2026

Btw, it's quite funny watching the ollama bros copy-pasting our bugs into their "new engine" 🤣. Let's see how long it will take them to realize.

That's the only reason we write buggy code, right? * cough *

@CISC
Copy link
Member

CISC commented Feb 4, 2026

@ngxson
Copy link
Contributor Author

ngxson commented Feb 4, 2026

@ngxson https://github.com/ggml-org/llama.cpp/actions/runs/21670903144/job/62478218069

hmm ok I ran locally editorconfig but didn't catch it earlier, probably I was on a wrong branch. pushing a fix along with #19331

@fizzAI
Copy link

fizzAI commented Feb 4, 2026

Btw, it's quite funny watching the ollama bros copy-pasting our bugs into their "new engine" 🤣. Let's see how long it will take them to realize.

https://github.com/ollama/ollama/pull/14051/changes#diff-1b8f23e564159d80674c3a97ca9f02489ad6ee90bf956f4ed92062811a6be0e5R447-R453
image

I 100% bet you they're just vibe-translating LCPP PRs to Go. lollllll

@Mushoz
Copy link

Mushoz commented Feb 4, 2026

Do GGUFs need to be regenerated after this change? I was under the impression that wouldn't be needed, but this message by the Unsloth team tells me that I do: https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/discussions/5

@CISC
Copy link
Member

CISC commented Feb 4, 2026

Do GGUFs need to be regenerated after this change? I was under the impression that wouldn't be needed, but this message by the Unsloth team tells me that I do: https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/discussions/5

No there are no conversion changes, no idea why they reconverted the model.

@ngxson
Copy link
Contributor Author

ngxson commented Feb 4, 2026

It should only affect I-quants, since imatrix is generated from intermediate activations.

Normal quants (Qx_0, Qx_1, Qx_K) should not be affected

@CISC
Copy link
Member

CISC commented Feb 4, 2026

Ah, yes, imatrix would be affected.

@danielhanchen
Copy link
Contributor

Oh yes Q8_K_XL, Q8_0, BF16, MXFP4_MOE are fine - the rest are imatrix so they did change a bit

liparetejas pushed a commit to liparetejas/llama.cpp that referenced this pull request Feb 23, 2026
…#19324)

* model: (qwen3next) correct vectorized key_gdiff calculation

* move transpose to outside of loop
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model Model specific

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants